Note

This page will display what superintendent widgets look like, but not respond to user input (as it’s not connected to a backend).

Distributing labelling across multiple people

One of the main challenges about labelling data is that it can take a lot of time.

To get around this, many people want to distribute the task across multiple people - potentially even outsourcing it to a crowd platform - and this is challenging using a standard in-memory python object.

In superintendent, you can get around this by using the superintendent.distributed submodule. The labelling widgets effectively replicate the widgets in the main superintendet module, but do so using a database to store the “queue” of objects, as well as the results of the labelling.

The distributed submodule stores and retrieves data from a SQL database, serialising / deserialising it along the way. You simply pass your data in the same way you do with superintendent widgets, and can retrieve the labels in the same way. In theory, other than having to set up the database, everything else should be the same.

Warning

For demonstration purposes, this example uses an SQLite file as a database. However, this is unsuitable for real distribution of labelling, as if it is on a shared file-system, it will break. In production, a database server is recommended (in the past, superintendent has been used with PostgreSQL).

The use case ultimately looks a bit like this:

distributed diagram

distributed diagram

This allows you to ask your colleagues to label data for you. By removing the labelling process from the active learning process, it also allows you to scale the compute that does the active learning, e.g. use a server with GPUs to train complex models, while the labelling user can just use a laptop.

Ultimately, the database architecture also means that you have more persistent storage, and are more robust to crashes.

Distributing the labelling of images across people

superintendent uses SQLAlchemy to communicate with the database, and all you need to provide is a “connection url”.

First, we make sure that we are using a completely fresh database:

In [1]:
import os
if os.path.isfile("demo.db"):
    os.remove("demo.db")
In [2]:
from sklearn.datasets import load_digits
import numpy as np
digits = load_digits().data

In [3]:
from superintendent.distributed import SemiSupervisor

widget = SemiSupervisor.from_images(
    connection_string="sqlite:///demo.db",
    options=range(10)
)

We can then add data to the database. Because every widget that connects to the DB, we should only run this code once:

In [4]:
widget.add_features(digits[:1000, :])

We can then start labelling data:

In [4]:
widget

You can inspect by using the widget.queue attribute, which encapsulates the database connection and the methods for retrieving and submitting data.

In [5]:
with widget.queue.session() as session:
    print(session.query(widget.queue.data).count())
1000
In [6]:
from pprint import pprint

with widget.queue.session() as session:
    pprint(session.query(widget.queue.data).first().__dict__)
{'_sa_instance_state': <sqlalchemy.orm.state.InstanceState object at 0x1a213b2b70>,
 'completed_at': datetime.datetime(2018, 11, 12, 11, 25, 8, 593477),
 'id': 1,
 'input': '{"__type__": "__np.ndarray__", "__content__": [0.0, 0.0, 5.0, 13.0, '
          '9.0, 1.0, 0.0, 0.0, 0.0, 0.0, 13.0, 15.0, 10.0, 15.0, 5.0, 0.0, '
          '0.0, 3.0, 15.0, 2.0, 0.0, 11.0, 8.0, 0.0, 0.0, 4.0, 12.0, 0.0, 0.0, '
          '8.0, 8.0, 0.0, 0.0, 5.0, 8.0, 0.0, 0.0, 9.0, 8.0, 0.0, 0.0, 4.0, '
          '11.0, 0.0, 1.0, 12.0, 7.0, 0.0, 0.0, 2.0, 14.0, 5.0, 10.0, 12.0, '
          '0.0, 0.0, 0.0, 0.0, 6.0, 13.0, 10.0, 0.0, 0.0, 0.0]}',
 'inserted_at': datetime.datetime(2018, 11, 12, 11, 25, 5, 301340),
 'output': '"0"',
 'popped_at': datetime.datetime(2018, 11, 12, 11, 25, 5, 470387),
 'priority': None,
 'worker_id': None}

As you can see, superintendent added our entries into the database. The format of this row is not necessarily important, as you can retrieve the data needed using superintendent itself.

Retrieving data from the distributed widget

Any superintendent connected to the database can retrieve the labels using widget.new_labels:

In [7]:
pprint(widget.new_labels[:30])
['0',
 '1',
 '2',
 '3',
 '4',
 '5',
 '6',
 '7',
 '8',
 '9',
 '0',
 '1',
 '2',
 '3',
 '4',
 '5',
 '6',
 '7',
 '8',
 '9',
 '0',
 '1',
 '2',
 '3',
 '4',
 '5',
 '6',
 '7',
 '8',
 '9']

Doing active learning during distributed labelling

One of the great benefits of using the distributed submodule is that you can perform active learning, where the labelling of data and the training of the active learning model are split across different machines. You can achieve this by creating a widget object that you don’t intend to use for labelling - only for orchestration of labelling by others:

In [8]:
from sklearn.linear_model import LogisticRegression

widget = SemiSupervisor(
    connection_string="sqlite:///demo.db",
    classifier=LogisticRegression(multi_class='auto', solver='lbfgs', max_iter=5000),
    reorder='margin'
)

Note

By default, the orchestration runs forever. You might not want this - for example, you might be interested in running the orchestration using cron scheduling. You can do that the way I am doing it below: by passing None as the interval_seconds keyword argument.

In either case, the orchestration is best run from a python script from the command line, rather than from a jupyter notebook.

In [9]:
widget.orchestrate(interval_seconds=None)
Score: 0.94